Instructions DELETE BEFORE SUBMISSION

Project Instructions Page

Project Instruction Video

Notes from Video instructions

The goal of the project is to explore the data, not make the best model
The best way to fail is to make a path and just stick to it. Don’t put the blinders on.


Introduction

The data we have chosen to look at is Housing Prices in California. This data comes from Kaggle and outlines data that would go in to predicting the price of a house in California. As people who currently rent (and one of us living in California), we hope to one day be able to purchase a home and being able to understand this model could help us determine important factors in predicting the price and whether future ones we intend to buy are a good deal or not.

In this document we will modify and assess the data, then using our asssessments attempt to build a good model which does not overfit or underfit the data. We hope to be able to create a general model that based on some factors about a given property can give an expected price as an output.

Original Variables:

  1. Median_House_Value: Median house value for households within a block (measured in US Dollars) [$]

  2. Median_Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) [10k$]

  3. Median_Age: Median age of a house within a block; a lower number is a newer building [years]

  4. Total_Rooms: Total number of rooms within a block

  5. Total_Bedrooms: Total number of bedrooms within a block

  6. Population: Total number of people residing within a block

  7. Households: Total number of households, a group of people residing within a home unit, for a block

  8. Latitude: A measure of how far north a house is; a higher value is farther north [°]

  9. Longitude: A measure of how far west a house is; a higher value is farther west [°]

  10. Distance_to_coast: Distance to the nearest coast point [m]

  11. Distance_to_Los_Angeles: Distance to the center of Los Angeles [m]

  12. Distance_to_San_Diego: Distance to the center of San Diego [m]

  13. Distance_to_San_Jose: Distance to the center of San Jose [m]

  14. Distance_to_San_Francisco: Distance to the center of San Francisco [m]

Variables added by the group later in the project:

  1. dist_to_nearest_city: The numeric minimum value of variables 11 through 14 divided by 1000 to convert to km. [km]

  2. nearest_city: Categorical variable indicating which city of those listed in variable 11 through 14 was the closest.[city name]

  3. near_a_city_100: a factor variable indicating 1 if a house is within 100 km of the nearest city or not. [0,1]

  4. near_a_city_200: a factor variable indicating 1 if a house is within 200 km of the nearest city or not. [0,1]


Methods

Data Analysis and Refinement

Read-in

First we need to load in the data and prepare some of the columns.

library(readr)
housing_data =  read.csv("California_Houses.csv")

To augment the data a bit, we need to take the predictors Distance to_Los Angeles, Distance to Los Angeles, Distance to Los Angeles, and Distance to Los Angeles and convert them into three separate columns: - a factor variable nearest_city- This segments the data into regions of California and if there is any relevance in being closer to one city vs. another. - a numeric variable dist_to_nearest_city - This will give us the distance to this nearest city in kms.

Geographic Modifications and Assessment

nearest_city = rep("", nrow(housing_data))
dist_to_nearest = rep(0, nrow(housing_data))
near_city = rep(0, nrow(housing_data))


nearest_city_options = c("LA", "San Diego", "San Jose", "San Fransisco")

for (i in 1:nrow(housing_data)) {
 subset = housing_data[i,
                       c("Distance_to_LA", 
                         "Distance_to_SanDiego", 
                         "Distance_to_SanJose", 
                         "Distance_to_SanFrancisco")]
 
 nearest_city[i] = nearest_city_options[which.min(subset)]
 dist_to_nearest[i] = min(subset) / 1000
 # near_city[i] = ifelse(dist_to_nearest[i] < 100, 1, 0)
}

housing_data$nearest_city = as.factor(nearest_city)
housing_data$dist_to_nearest_city = dist_to_nearest
# housing_data$near_a_city = as.factor(near_city)

We will then perform a quick assessment of the variables we just created. First we will inspect what percentage of the properties is closest to each city

data.frame(
Los_Angeles = mean(housing_data$nearest_city == "LA"),
San_Diego = mean(housing_data$nearest_city == "San Diego"),
San_Jose = mean(housing_data$nearest_city == "San Jose"),
San_Fransisco = mean(housing_data$nearest_city == "San Fransisco"))
##   Los_Angeles San_Diego San_Jose San_Fransisco
## 1      0.4759   0.09685   0.1824        0.2449

And overall numbers for nearest city:

summary(housing_data$nearest_city)
##            LA     San Diego San Fransisco      San Jose 
##          9823          1999          5054          3764

We will also run a quick sanity check that ensures that based on latitude and longitude we do indeed have the correct nearest city.

plot(Latitude ~ Longitude, housing_data,
     col = nearest_city,
     pch = as.numeric(nearest_city))
legend("topright",
       legend = levels(housing_data$nearest_city),
       col = c(1:4),
       pch = c(1:4)) 

Next we will gather some data on how far data points are from the nearest city.

summary(housing_data$dist_to_nearest_city)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.4    17.2    36.1    69.5    93.1   489.6

And we will plot the distribution of distance as both a box plot and a histogram for help framing these distances.

par(mfrow = c(1,2))
boxplot(housing_data$dist_to_nearest_city,
        ylab = "Distance to Nearest City [km]",
        main = "Boxplot of Distances to Nearest\n City")

hist(housing_data$dist_to_nearest_city,
     xlab = "Distnace ot Nearest City [km]",
     main = "Histogram of Distances to Nearest City")

Between the summary information and the boxplot, we can assess that more than 3/4 of the properties in the dataset are within 100 km of the nearest city. We will use this to create a new variable called near_city_100 a factor variable which evaluates to 1 if within 100 km of a city and 0 otherwise.

housing_data$near_a_city_100 = as.factor(housing_data$dist_to_nearest_city < 100)

We also see that around the 200km mark is where the whisker ends on our boxplot above so we will investigate what proportion of data falls within 200 km

mean(housing_data$dist_to_nearest_city < 200)
## [1] 0.9161

given that more than 91% of the data falls within 200 km of a city we will also create another factor variable for this value.

housing_data$near_a_city_200 = as.factor(housing_data$dist_to_nearest_city < 200)

Our hope here is that one of the two factor variables created will be a sufficient demarcation line to where certain variables start to have differing effects on the response when we build a model. We will explore that further later.

mean(as.numeric(housing_data$near_a_city_100) - 1)
## [1] 0.7664

76.64% percent of data points are within 100 km’s of the center of the nearest city.

mean(as.numeric(housing_data$near_a_city_200) - 1)
## [1] 0.9161

91.61% percent of data points are within 100 km’s of the center of the nearest city.

Having harvested the data from the distance to each city variable we will now eliminate them from the dataset in order to make plotting and analysis more manageable.

housing_data = subset(housing_data, 
                      select = -c(Distance_to_LA,
                                  Distance_to_SanFrancisco,
                                  Distance_to_SanDiego,
                                  Distance_to_SanJose))
data.frame(name = names(housing_data))
##                    name
## 1    Median_House_Value
## 2         Median_Income
## 3            Median_Age
## 4             Tot_Rooms
## 5          Tot_Bedrooms
## 6            Population
## 7            Households
## 8              Latitude
## 9             Longitude
## 10    Distance_to_coast
## 11         nearest_city
## 12 dist_to_nearest_city
## 13      near_a_city_100
## 14      near_a_city_200

General Data Analysis

With the data loaded and prepped we want to start building the model. Before we do that, we want to check the pairs of all the different predictor variables to see what predictors have strong correlations. We will leave out - Latitude - Longitude And represent nearest_city as a color and near_a_city_100 by symbol.

plot(housing_data[ , c(2:7, 10,12)],
     col = as.numeric(housing_data$nearest_city),
     main = "Plot of Every Variable vs Every Other Variable in Housing Data (some withheld)")

A couple of obvious colinearities jump out.
- Tot_Rooms - Tot_Bedrooms

cor(housing_data$Tot_Rooms, housing_data$Tot_Bedrooms)
## [1] 0.9299
  • Tot_Rooms - Population
cor(housing_data$Tot_Rooms, housing_data$Population)
## [1] 0.8571
  • Tot_Rooms - Households
cor(housing_data$Tot_Rooms, housing_data$Households)
## [1] 0.9185
  • Tot_Bedrooms - Population
cor(housing_data$Tot_Bedrooms, housing_data$Population)
## [1] 0.878
  • Tot_Bedrooms - Households
cor(housing_data$Tot_Bedrooms, housing_data$Households)
## [1] 0.9798
  • Population - Household
cor(housing_data$Population, housing_data$Households)
## [1] 0.9072

It appears that the variables that have to do with density see a strong positive correlation. This makes sense. As the total number of people within a block (population) increases, you also see an increase in total households within a block (Households) which leads to an increase in both total bedrooms (Total_Bedrooms) and total rooms (Total_Rooms). We are not attempting to demonstrate causation, simply how density indicators are linked to each other.

These variables may still have interactions that we will explore later. For instance an area where the population density is low and the number of rooms is high or the number of households is low but the number of rooms is high may indicate an increase in house value. We will keep this in mind for later.

We will also explore variable correlations within the context of whether they are close to or far away from the city, in order to check to see if any patterns emerge within either that were otherwise hidden.

str(housing_data)
## 'data.frame':    20640 obs. of  14 variables:
##  $ Median_House_Value  : num  452600 358500 352100 341300 342200 ...
##  $ Median_Income       : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ Median_Age          : int  41 21 52 52 52 52 52 52 42 52 ...
##  $ Tot_Rooms           : int  880 7099 1467 1274 1627 919 2535 3104 2555 3549 ...
##  $ Tot_Bedrooms        : int  129 1106 190 235 280 213 489 687 665 707 ...
##  $ Population          : int  322 2401 496 558 565 413 1094 1157 1206 1551 ...
##  $ Households          : int  126 1138 177 219 259 193 514 647 595 714 ...
##  $ Latitude            : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ Longitude           : num  -122 -122 -122 -122 -122 ...
##  $ Distance_to_coast   : num  9263 10226 8259 7768 7768 ...
##  $ nearest_city        : Factor w/ 4 levels "LA","San Diego",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ dist_to_nearest_city: num  21.3 20.9 18.8 18 18 ...
##  $ near_a_city_100     : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
##  $ near_a_city_200     : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
city = housing_data$near_a_city_100 == TRUE
non_city = ! city
plot(housing_data[ city, c(2:7, 10,12)],
     main = "City Based Plots",
     col = "darkgray")

plot(housing_data[ non_city, c(2:7, 10,12)],
     main = "Non-City Based Plots",
     col = "darkblue")

No further trends obviously emerge from breaking out the data into city vs non-city.

Before we move on from this we will attmept to see if any of the cities have colinearities specific to their locality. In order to do this we will repeat the previous step one for each city using only the city data, in hopes of isolating information specific to the major cities which cover most of the data.

levels(housing_data$nearest_city)
## [1] "LA"            "San Diego"     "San Fransisco" "San Jose"
la = housing_data$nearest_city == "LA" & city
sd = housing_data$nearest_city == "San Diego" & city
sf = housing_data$nearest_city == "San Fransisco" & city
sj = housing_data$nearest_city == "San Jose" & city

plot(housing_data[la, c(2:7, 10,12)],
     main = "Los Angeles Based Plots",
     col = 1)

plot(housing_data[sd, c(2:7, 10,12)],
     main = "San Diego Based Plots",
     col = 2)

plot(housing_data[sf, c(2:7, 10,12)],
     main = "San Fransisco Based Plots",
     col = 3)

plot(housing_data[sj, c(2:7, 10,12)],
     main = "San Jose Based Plots",
     col = 4)

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
?pairs
#commented this because I modified the columns and didn't want to mess with your mojo.
# ggpairs(housing_data,
#         columns = c(1, 2:5),        # Columns
#         aes(color = nearest_city,  # Color by group (cat. variable)
#             alpha = 0.5))
# ggpairs(housing_data,
#         columns = c(1, 6:9),        # Columns
#         aes(color = nearest_city,  # Color by group (cat. variable)
#             alpha = 0.5))
# ggpairs(housing_data,
#         columns = c(1, 10:13),        # Columns
#         aes(color = nearest_city,  # Color by group (cat. variable)
#             alpha = 0.5))
# ggpairs(housing_data,
#         columns = c(1, 14:15),        # Columns
#         aes(color = nearest_city,  # Color by group (cat. variable)
#             alpha = 0.5))

It looks like median age has little to do with predicting the house value, so removing that will reduce our model size. –Can we hold off on cutting age? Based on what I understand about realestate and tax brackets it may be relevant later… Though if it includes children might be a red herring because kids are balancing adult age. –

The last graph we will create for assistance is a graph of Median_House_Value vs all other predictors. We will leave the graphs somewhat sparse to allow for

par(mfrow = c(3,3))
plot(Median_House_Value ~ Median_Income, housing_data, col = 1)
plot(Median_House_Value ~ Median_Age, housing_data, col = 2)
plot(Median_House_Value ~ Tot_Rooms, housing_data, col = 3)
plot(Median_House_Value ~ Tot_Bedrooms, housing_data, col = 4)
plot(Median_House_Value ~ Population, housing_data, col = 5)
plot(Median_House_Value ~ Households, housing_data, col = 6)
plot(Median_House_Value ~ Latitude, housing_data, col = 7)
plot(Median_House_Value ~ Distance_to_coast, housing_data, col = 8)
plot(Median_House_Value ~ dist_to_nearest_city, housing_data, col = 9)

# housing_data = subset(housing_data,select = -c(Median_Age))

Model Creation and Refinement

set.seed(420)
housing_data_idx  = sample(nrow(housing_data), size = trunc(0.80 * nrow(housing_data)))
housing_data_trn = housing_data[housing_data_idx, ]
housing_data_tst = housing_data[-housing_data_idx, ]
add_model = lm(Median_House_Value ~ ., data = housing_data_trn)
int_model = lm(Median_House_Value ~ (.) ^ 2, data = housing_data_trn)
library(faraway)
## 
## Attaching package: 'faraway'
## The following object is masked from 'package:GGally':
## 
##     happy
vif(add_model)
##             Median_Income                Median_Age                 Tot_Rooms 
##                     1.841                     1.439                    13.198 
##              Tot_Bedrooms                Population                Households 
##                    39.734                     6.505                    40.066 
##                  Latitude                 Longitude         Distance_to_coast 
##                    37.838                    28.112                     5.389 
##     nearest_citySan Diego nearest_citySan Fransisco      nearest_citySan Jose 
##                     1.862                    17.038                     8.326 
##      dist_to_nearest_city       near_a_city_100TRUE       near_a_city_200TRUE 
##                     9.131                     4.521                     2.905
add_model = lm(Median_House_Value ~ ., data = housing_data_trn)
vif(add_model)
##             Median_Income                Median_Age                 Tot_Rooms 
##                     1.841                     1.439                    13.198 
##              Tot_Bedrooms                Population                Households 
##                    39.734                     6.505                    40.066 
##                  Latitude                 Longitude         Distance_to_coast 
##                    37.838                    28.112                     5.389 
##     nearest_citySan Diego nearest_citySan Fransisco      nearest_citySan Jose 
##                     1.862                    17.038                     8.326 
##      dist_to_nearest_city       near_a_city_100TRUE       near_a_city_200TRUE 
##                     9.131                     4.521                     2.905
#mod = step(int_model, direction = "backward", trace = 0)

Model Comparison and Selection


Results


Discussion


Appendix


Group Members

  • Brayden Turner - brturne2
  • Caleb Cimmarrusti - Calebtc2